resVAE ensemble: Unsupervised identification of gene sets in multi-modal single-cell sequencing data using deep ensembles

Feature identification and manual inspection is currently still an integral part of biological data analysis in single-cell sequencing. Features such as expressed genes and open chromatin status are selectively studied in specific contexts, cell states or experimental conditions. While conventional analysis methods construct a relatively static view on gene candidates, artificial neural networks have been used to model their interactions after hierarchical gene regulatory networks. However, it is challenging to identify consistent features in this modeling process due to the inherently stochastic nature of these methods. Therefore, we propose using ensembles of autoencoders and subsequent rank aggregation to extract consensus features in a less biased manner. Here, we performed sequencing data analyses of different modalities either independently or simultaneously as well as with other analysis tools. Our resVAE ensemble method can successfully complement and find additional unbiased biological insights with minimal data processing or feature selection steps while giving a measurement of confidence, especially for models using stochastic or approximation algorithms. In addition, our method can also work with overlapping clustering identity assignment suitable for transitionary cell types or cell fates in comparison to most conventional tools.


Introduction
A plethora of tools have been developed over the years for the analysis of high-throughput single-cell data (scRNA-seq, scATAC-seq, CITE-seq etc.) and made great contributions towards enabling the exploration and analysis of the omics landscape of different tissues and organs at an unprecedented resolution. Yet regardless of the tools or methods, human intervention is required to validate or further elucidate the hypotheses through the identification of markers or characteristic features (expressed genes, open chromatin peaks etc.) for the different cell populations clustered by cell type, perturbation, or other conditions and labels. Conventional methods and tools primarily perform highly variable genes (HVGs) or differentially expressed genes (DEGs) analyses based on the different clusters through statistical testing that often generate an extensive list of genes that are ranked according to their logarithmic fold change values or significance scores (Robinson et al., 2010;Love et al., 2014;Ritchie et al., 2015). These include tools such as Monocle (Cao et al., 2019), Scanpy (Wolf et al., 2018) and Seurat (Stuart et al., 2019) for scRNA-seq data, as well as ArchR (Granja et al., 2021), Signac (Stuart et al., 2021) and snapATAC (Fang et al., 2021) for scATAC-seq data. These lists are often cut on arbitrary values that are deemed meaningful in the different clusters within the studied biology, where the reduced lists of features still contain overwhelming numbers of genes even after manual curation. Removing outliers or specific features through prior knowledge could improve the analysis by restricting the scope at the cost of introducing certain biases. A complementary systematic approach could retain potentially interesting or novel features in a less biased manner.
These conventional analysis techniques backed by sound statistical tests and methods remain the gold standard for well-established tasks they excel at. However, machine or deep learning methods that can complement the nature of biological data are increasingly being applied as the complexity of these single-cell data increases. In contrast to statistical methods that look at genes in isolation, methodologies involving the use of deep learning techniques are context-aware and can deal with non-linearities. Moreover, as we gain access to new technologies and techniques that make new data collection possible, it is even more crucial to ensure that these data can be studied and analyzed using similar tools. In particular, the characteristics of variational autoencoders (VAE) make them suitable to address some of the most common tasks in analyzing single-cell sequencing datasets, especially when it involves the comparison or identification of cell populations (Way et al., 2020). A typical approach is to compare individual or a group of these cells and identify the features that characterize the populations. This can be performed by VAEs, where the model learns from the input data and identifies features that characterize the given annotations, which can consist of cell-state or cell-type identities, experimental batches, clinical metadata or perturbations etc. (Lotfollahi et al., 2019).
However, neural networks are non-deterministic during training, as they make use of random partitioning of data, initialization of weights, sampling in each iteration, regularization in terms of dropouts etc. (Scardapane and Wang, 2017). This randomness can contribute to robustness and improve the performance of these models, at the cost of making it difficult to identify consistent features as no two neural networks will be alike even if trained with the same parameters on the same data. Simpler models can usually reach convergence, but becomes challenging with more complex datasets or models, especially when the rankings of the outputs are taken into consideration. Thus, even if the identified features overlap across the different runs, their rankings are almost certainly not guaranteed to be consistent.
We previously introduced the restricted latent variational autoencoder (resVAE) neural network architecture that enables the decomposition of single-cell transcriptomic data into populationbased features in a hierarchical manner that closely resemble biological systems (Lukassen et al., 2020). Our intention is to provide a supportive tool for additional knowledge mining on top or as a part of the user's preferred analysis method. Here, we introduce resVAE ensemble with the use of rank aggregation to better address the consistency and confidence of feature identification in single-cell analysis. In contrast to conventional tools that focus on discrete cluster assignment, resVAE ensemble can be used on cell populations with transitory or partial cluster assignments. Most importantly, the introduction of deep ensembles allowed us to measure the consistencies or confidence metrics of the extracted features as determined by the different models.

Results
With an ensemble, multiple models can be combined to produce better results than what an individual model alone could achieve. Meanwhile, rank aggregation allows us to retain the ranking information from the different ranked results and combine them into a single ranking. We trained multiple resVAE models on the same inputs to learn the cluster-specific features independently. Next, we combined the outputs of all the models and introduced different rank aggregation algorithms that could reduce them into a single ranked consensus, which results in more robust and consistent outputs as compared to using only a single model ( Figure 1A). Furthermore, we also showed that resVAE ensemble can be applied on various singlecell data modalities such as simulated data, scRNA-seq transcript counts data and scATAC-seq peaks data to identity features for further analysis ( Figures 1B, C).

Advantages of resVAE ensemble on simulated dataset
To demonstrate that resVAE ensemble can identify features relevant to the experiment based on the input data and annotations, we first performed the analysis on a simulated synthetic dataset used to model myeloid differentiation in hematopoiesis as previously described by Krumsiek et al., 2011. In this dataset, common myeloid progenitors (Prog) are simulated to branch off into four different cell fates with distinct expression profiles, namely megakaryocytes (Mk), erythrocytes (Ery), granulocytes (Granu) and monocytes (Mono). In this simulated experiment, resVAE ensemble was performed on the obtained synthetic counts data and annotated labels to identify features that characterize the different cell fates.
The UMAP of this data ( Figure 2A) shows the different cell fates populations, where progenitors can be seen in the middle of the figure and branch off into the different main cell fates. In our findings, resVAE ensemble was able to identify the corresponding transcription factors that are relevant for each of these cell populations ( Figure 2B). For instance, in the Erythrocytes and Megakaryocytes clusters, resVAE ensemble identified EKLF and Fli-1 as the defining feature for each of these clusters, respectively, as expected. Additional findings that conformed to experimental data were recapitulated too, such as GATA-1 being identified in both Erythrocytes and Megakaryocytes clusters. Similarly, features in the Monocytes cluster were identified consistently, with PU.1, cJun, EgrNab and C/EBPα standing out in comparison to the rest. In the Granulocytes cluster, resVAE ensemble identified Gfi-1 as the definitive feature, though it did not seem to consider PU.1 and C/EBPα. The consistency of the various trained models can be visualized using a parallel coordinate plot, where highly consistent clusters display thick uniform lines and vice versa ( Figure 2C).
Additionally, to demonstrate the advantages of the ensemble, we also investigated the different resVAE models within the ensemble individually. While the resVAE model is quite robust in general, it is inevitable that models can get stuck at some local minima or overfit during training where major differences and variations between individual resVAE models can be observed ( Figure 2D). On the The variational autoencoder based resVAE architecture allows each label to have its own label-specific latent space for label-specific features identification. (A) shows the overall design of the resVAE ensemble workflow. (B) shows the use case of the resVAE ensemble methodology that can be applied to various forms of single-cell data with their corresponding cluster identities, including simulated data, scRNA-seq counts and scATAC-seq peaks. (C) highlights the overview of this manuscript. We highlighted the application of the resVAE ensemble methodology on simulated data and single-cell sequencing datasets of different modalities to identify features that could be used for further analysis.
Frontiers in Cell and Developmental Biology frontiersin.org 03

FIGURE 2
Results of resVAE ensemble on two simulated datasets. (A-D) correspond to the simulated PBMC dataset, while (E-I) correspond to the simulated bifurcation model dataset. (A) shows the UMAP of the cells from the simulated myeloid differentiation dataset provided in Scanpy (Krumsiek et al., 2011;Wolf et al., 2018). Frontiers in Cell and Developmental Biology frontiersin.org one hand, we have Model 0 ( Figure 2D, left) that did not manage to associate the EKLF feature with the Erythrocytes cluster as well as performing badly for the other clusters. On the other hand, Model 10 ( Figure 2D, right) successfully identified relevant features for all the different clusters. With resVAE ensemble, these events can be detected and rescued by producing a consensus that showed reliable results ( Figure 2B, right), where most of the identified features are valid and relevant, even if a few might be misidentified in some cases. Moreover, we also compared the performance of the different models trained with a combination of different hyperparameters (Supplementary Figure S1) in the Erythrocytes and Granulocytes clusters as an example. 10 individual models for each selected hyperparameters combination were trained separately to illustrate the differences that can arise from the training, even if the inputs and the hyperparameters are kept identical. In the Erythrocytes cluster, we observed that the different ReLU models tend to vary more, while models with a different activation function (Mish) hyperparameter are more consistent (Supplementary Figure S1). The Granulocytes cluster seems to pose a challenge for resVAE ensemble, as the different models display much more variety in comparison to the Erythrocytes cluster (Supplementary Figure S1). Interestingly, models with different hyperparameters seem to be able to identify different features better, for example the Mish activation function models identified SCL, FOG-1 and GATA-1 much more consistently in the Erythrocytes cluster and Gfi-1 in the Granulocytes cluster, in comparison to the ReLU models.
In summary, with this experiment we showed that resVAE can extract meaningful features from the simulated dataset, while the ensemble can aid in identifying and selecting features that are consistently identified. We also note that resVAE ensemble was able to identify these features without performing highly variable or differential expression comparisons.
Advantages of resVAE ensemble that enable novel cluster assignment to cells on trajectory data with transitory cell populations Next, we demonstrated a major advantage of resVAE ensemble that enables the use of soft or partial cluster assignments in addition to the hard or discrete cluster assignment that is performed routinely. Essentially, cells can be assigned partial identities consisting of mixed archetypes as would be expected in transitory cell states or developing cell types.
To demonstrate this, we constructed a synthetic trajectory data with interdependent hierarchical features simulating the roles of transcription factors-regulated gene expressions and investigated the performance of resVAE ensemble on this dataset (Supplementary Figure S2). In this bifurcation model, cells are simulated to transition from an initial state sA and fork into one of two distinct populations sC and sD under the influence of antagonistic regulators ( Figure 2E). While a total of seven populations were defined in our simulation, we also show that it is possible to determine the optimal number of clusters using resVAE by measuring different clustering metrics (Supplementary Figure S3). Here, resVAE reported an optimal cluster number between 3 and 6 based on various scoring metrics, which seems reasonable based on the UMAP ( Figure 2E). We then investigated and compared the performance of resVAE ensemble on two distinct (soft and hard) clustering and analysis methods.
Our findings in this simulated dataset again demonstrated that resVAE ensemble can identify population-specific features regardless of the cluster assignment methods. resVAE ensemble performed comparably using either method based on clustering evaluation metrics (Silhouette Coefficient of 0.52 vs. 0.55; Calinski-Harabasz score of 377.44 vs. 305.11; Davies-Bouldin score of 0.75 vs. 0.71). The main differences are most apparent in populations that are more heterogeneous and less consistent in terms of expression profile. This is most evident in sBmid which is not only very close and similar to sB, but would also incorporate similar cells belonging to the sC and sD populations in the soft clustering method ( Figure 2E). In clusters that are more well defined such as sEndC, the structures of the trained resVAE decoders' weights mappings are very consistent and highly overlapping regardless of the assignment method ( Figure 2F, top). Meanwhile, the structure of the sBmid population is visibly disharmonized and misaligned in the models trained on the two different cluster assignment methods, indicating that the learned features are more different ( Figure 2F, bottom). In the hard clustering-derived discrete clusters, these identified features are mostly exclusive to their own corresponding clusters, though some features are shared across clusters that are close to each other, as seen with clusters sA, sB and sBmid ( Figure 2G, left). This is largely applicable to the soft clustering-derived clusters as well, though there are slightly more shared features identified due to the mixed or partial identities of some cells ( Figure 2G, right).
In addition, resVAE ensemble successfully recapitulated meaningful results as defined in the transcription factors of the simulation ( Figure 2H). For example, among the different module-specific transcription factors, the initialization of the sA state relies solely on the Burn1-4, B4 and B5 modules. This was captured strongly in the discrete cluster models ( Figure 2H, top), as resVAE ensemble was able to identify the four Burn modules as well as the B4 and B5 modules specifically. Most of the differences in the soft cluster assignment method can be attributed to the fuzzier sB and sBmid populations ( Figure 2H, bottom). The partial cluster models can better separate the effects of the A and B modules in the sBmid populations. Curiously, resVAE ensemble identified the D6 module in the sBmid population in the partial cluster models ( Figure 2H, bottom), which could be explained by the proximity of a group of sD cells in the sBmid cluster. In the hard clustering method, this nuance is lost as these cells would not be included in the sBmid cluster. Regardless of the clustering methods used, the modules for the post-fork populations (sC, sD, sEndC, sEndD) were successfully identified, where the strong antagonizing interactions of these features can be observed. In both models, the B2, B6, B9, B10 and B11 modules that push sBmid to transition towards sC are identified in the sC and sEndC populations as expected. Similarly, the B3, B7, B12, B13 and B14 modules that push sBmid towards sD are likewise identified correctly. In addition, B8 that is present in both branches are identified, while terminal populations (sEndC and sEndD) tend to have stronger emphasis on their population-specific modules in comparison to the preterminal populations (sC and sD).
Regardless of the cluster assignment methods used, resVAE ensemble consistently identified not only features that can distinctly separate dissimilar populations, but also features that can be used to distinguish similar sub-clusters. Unsurprisingly, the features identified by resVAE ensemble can be affected by the fluidity of the clusters. In the case of hard clusters assignment, the identified features are more pronounced in the relevant clusters, whereas in the case of partial or mixed identities, the identified Frontiers in Cell and Developmental Biology frontiersin.org Frontiers in Cell and Developmental Biology frontiersin.org 06 features are not as pronounced and tend to bleed into similar clusters ( Figure 2I). Depending on the homo-or heterogeneity of the cells within their defined clusters, the number of features identified by resVAE ensemble could decrease or increase accordingly as the compositions of the clusters vary (Supplementary Figure S4).
In short, we successfully demonstrated that resVAE ensemble can be used with both discrete and mixed or partial cell identities that could characterize transitory cell types or cell states more accurately.

Evaluating resVAE ensemble in real interferon β-stimulated PBMCs biological dataset
To further investigate and demonstrate the performance of resVAE ensemble in gene sets identification tasks in actual biological data, we next applied the methodology on interferon (IFN) β-stimulated human peripheral blood mononuclear cells (PBMCs) scRNA-seq counts data ( Figure 3A), which are mostly immune cells consisting of T cells, B cells, monocytes etc. (Kang et al., 2018).
To assess the performance of resVAE ensemble in elucidating cluster identities, we artificially split all the clusters into two partitions of equal sizes ( Figure 3A). In total, 30 artificial cell type clusters were generated from 15 original cell type clusters. While not exactly comparable, under-or over-clustering occurs frequently during analysis depending on the granularity of the annotations. We systematically compared the different clusters and their identified genes by performing hypergeometric tests ( Figure 3B). The results showed that neither the large numbers nor the similarities between the clusters affected the performance of resVAE ensemble.
In fact, we were able to identify consistent features that are relevant to the clusters. This is most prominent between the two partitions of the CD14 Mono cluster ( Figure 3C, left) that highlighted an example of significant overlap with high scores, showing that artificially split clusters of the same origin not only yield highly overlapping identified genes in resVAE ensemble, but also relatively low or balanced non-overlapping genes. Here, resVAE ensemble identified 103 overlapping genes with only 15 genes that are not shared. These overlapping genes highly enrich for terms related to immune activities such as neutrophil activation, inflammatory response, cytokine-mediated signaling, cellular response to chemokine etc. This is exemplary considering the two partitions are not entirely identical and that some of the exclusive genes (CD48, CTSC, CYBB, RTN4 etc.) have immune-related functions as well. Since the CD14 Monocytes constitute the largest population of cells in this dataset, resVAE ensemble was able to better learn the characteristics of these cells and identify genes more confidently.
Next, we also compared the genes identified by resVAE ensemble in the CD14 Monocytes 1 cluster versus the Erythrocytes 1 cluster ( Figure 3C, middle), which is an example of two dissimilar clusters overlapping with a lower score and sparse number of overlapping ranked genes. Of the 137 genes involved, these two dissimilar clusters only share 5 overlapping identified ranked genes (IER3, IL1B, LIMS1, NEAT1 and TALDO1). This contrast between almost identical (CD14 Mono 1 and CD14 Mono 2) versus highly dissimilar clusters (CD14 Mono 1 and Eryth 1) can be observed in the Andrews curves plot (Figure 3D), where the CD14 Mono 1 (magenta) and CD14 Mono 2 (lime) lines fully overlapped, versus the misalignment of the CD14 Mono 1 (magenta) and Eryth 1 (mustard) lines. Once again, this confirmed that resVAE ensemble can consistently distinguish between these different cell populations and identify genes that are relevant in the cluster-specific context.
Similarly, resVAE ensemble considers the CD16 Mono 1 cluster to be closely related to the Monocyte/Megakaryocyte Doublets 2 cluster, where almost all the genes identified in CD16 Monocytes are encompassed in the latter cluster ( Figure 3C, right). Excluding these shared CD16 Monocytes cluster genes from the Monocyte/ Megakaryocyte Doublets cluster yielded many genes that enrich for immune terms related to neutrophils, inflammatory responses, cytokines etc. Intriguingly, resVAE ensemble did not identify any shared or common genes between the actual Megakaryocytes cluster and the Monocyte/Megakaryocyte Doublets cluster (Supplementary Figure S5), suggesting that resVAE considers them as vastly different clusters which is also apparent from the UMAP plot ( Figure 3A). However, some identified genes could be functionally linked to Megakaryocytes, such as platelet function (CD9, FLNA, TLN1, PLA2G7, SERPING1, APLP2, ANXA5, MARCKS etc.), cell adhesion (THBS1, LGALSs, FERMT3 etc.), cell motility (TPM4 etc.) and mediators of immune relevance (IL1B, IL8, CXCL1, CXCL2, TGFBI etc.) (Cunin and Nigrovic, 2019;Kammers et al., 2021). resVAE also managed to identify other biologically meaningful genes missed by Seurat, including a cell surface receptor on B cells involved in isotype switching (CD40; Bishop and Hostager, 2001), a cell adhesion molecule expressed on T cells that is a well-known marker for naïve and memory cells (SELL; Yang et al., 2011), a proinflammatory cytokine expressed on NK cells (IL32; Yang et al., 2019) and more ( Figure 3E).
With this experiment, we further demonstrated that resVAE ensemble works on real scRNA-seq biological datasets, where the identified features correspond to the cluster identities in terms of biology and cell-type markers (Supplementary Figure S6). Clusters that are similar will have similar profiles and vice versa, thus potentially enabling a systematic approach to elucidate cluster identities. In addition, we showed that the performance of resVAE ensemble conforms to the clusters in a context-aware manner such that the number of clusters involved and their similarities did not seem to negatively affect the performance. resVAE not only identified genes consistently between the partitions, but also identified some genes that would be similarly identified using conventional methods such as Seurat and MAST (McDavid et al., 2015) as well ( Figure 3F). Most importantly, we also showed that resVAE ensemble can capture biologically meaningful genes that could be missed by other methods (Supplementary Figure S7).
Explorative analysis using resVAE ensemble on scRNA-seq and scATAC-seq multi-modal data Next, we aimed to demonstrate that resVAE ensemble can be used directly on other single-cell data modalities such as scATAC-seq. To that end, we investigated the performance of resVAE ensemble on genes and peaks or open-chromatin region identification tasks in human PBMC scRNA-seq and scATAC-seq dataset respectively ( Figure 4A). In contrast to single-cell RNA sequencing techniques that profile the transcriptomic landscape of single cells by capturing and measuring the RNA transcripts as a proxy to infer gene or protein expressions, single-cell ATAC sequencing techniques profile the epigenomic regulatory information of single-cells by capturing open regions in the chromatin accessible to regulators. One such possibility is to investigate transcription factor binding motifs that are  Figure 4B). In both scRNA-seq and scATAC-seq datasets, resVAE ensemble was able to consistently identify features that are relevant to the corresponding clusters ( Figures 4C, D). The results also suggest that the identified genes and peaks are biologically meaningful, as clusters with similar identities or functions share more features in common and vice versa. For example, in the scRNA-seq dataset, the two B cells clusters-pre-B and B cell progenitor-have some identified features in common, while most of the T cells populations are also similarly clustered together based on the identified genes. Comparable results can be observed in the scATAC-seq dataset as well, where common peaks were identified in the Monocytes and B cells populations. Encouragingly, the two NK cells clusters were observed to share both genes and peaks not only between themselves but also with the CD8 Effector T cells that function similarly (Narni-Mancinelli et al., 2011) ( Figures 4C, D). Apart from common features, resVAE ensemble also identified distinct cluster-specific features, with the two CD14 Monocytes clusters being the most apparent considering the minimal connectivity between them ( Figure 4C, bottom; Figure 4D, right). Meanwhile, a good balance between shared and distinct features was reached in the subpopulations of the B cells and NK cells clusters. In the case of scRNA-seq data, most clusters share a minimal number of genes, whereas clusters that do, share a considerable number of them, as shown among the NK cells and CD8 Effector T cells clusters, as well as the B cells clusters ( Figure 4C, top). This is largely recapitulated in the scATACseq data where peaks were identified instead of genes ( Figure 4C, bottom; Figure 4D, right). In addition, one of the CD14 Monocytes cluster seemed to share many peaks with the CD16 Monocytes cluster. Regardless, resVAE ensemble certainly identified more peaks that are shared in the scATAC-seq data in comparison to shared genes in the scRNA-seq data. In the biological context, the increase in shared features can be expected since different peaks could be mapped into one or several annotations and vice versa. Instead of identifying individual peaks that were considered distinct, the common genes that each distinct peak can be mapped to can now be identified and merged.
A brief cursory exploratory analysis into the enrichment results using the peaks identified in the NK CD56Dim cluster revealed some biologically meaningful results ( Figure 4E). For instance, the peaks identified in both the CD14 Monocytes and NK cells clusters yielded enrichment terms that are highly suggestive of their cell type identities based on the Human Molecular Signature Database's (MSigDB) "C8: cell type signature" gene sets (Liberzon et al., 2011). The identity of the NK cells cluster is obvious, while the identity of the Monocytes can be inferred after taking into consideration the samples studied (myeloid cells), and the nonmonocytes terms such as microglia and Kupffer cells being closely related to monocytes or macrophages in terms of their function. Overall, the performance of resVAE ensemble here is within expectations, as PBMCs are remarkably similar in terms of their associated gene sets or enriched terms, not to mention the indirect interpretation at the open chromatin level.
Nevertheless, we successfully demonstrated the use of resVAE ensemble on joint or multi-modal analyses by analyzing scRNAseq genes and scATAC-seq peaks of PBMCs with biological meaning. We also explored the regions identified by resVAE ensemble, which seemed to be relevant to the clusters in terms of the enriched transcription factor binding motifs or the closest genes that could be linked to these regions. In general, the consistencies of the identified features for specific clusters are affected by the number of cells in the cluster, where clusters with larger number of cells exhibit higher consistencies. We remark that we used the default hyperparameters for the resVAE models, even when the number of features in this scATAC-seq data with more than 87,560 peaks is more than 4.6-fold of the scRNA-seq data with 19,000 genes. Remarkably, the default encoder and decoder layers with 256 and 512 neurons seemed to be sufficient without having to scale up the model. These results also further suggested that the resVAE ensemble model is robust against hyperparameters tuning, making it feasible to get started immediately and have acceptable results to explore straightaway without excessive tweaking or benchmarking.

Discussion
In this manuscript we detailed the introduction and implementation of a methodology combining deep ensembles with rank aggregation focusing on our resVAE architecture. resVAE ensemble is a methodology that implements rank aggregation of deep ensembles on the resVAE architecture that allows the capturing of cluster-specific hierarchical features in single-cell data. The introduction of deep ensembles with rank aggregation improves the consistencies and provides a measurement of confidence for the identified features. To be clear, we neither claim that resVAE ensemble itself is superior to other existing tools, nor intend for this to be a benchmarking study, since the methodology can be used alongside other reported methods. In addition, it is also not especially straightforward to directly compare and benchmark tools aimed at different use cases or niches without a good scoring metric or ground truths. This proves to be difficult when it involves context-specific biological interpretations, even more so without a strong confidence in terms of experimental or background knowledge. The aim of identifying relevant genes in a cell population also raises the question about how populations of mixed cell-types or cell doublets are treated, and whether they can be detected by resVAE ensemble. Our experiments showed that this may be achieved by splitting the cluster in question into several partitions and then comparing the identified genes among these partitions. The ratio of overlapping genes identified by resVAE ensemble between identical partitions should be much lower if the cluster consists of doublets or varied cell populations as demonstrated (Supplementary Figure S8).
Nevertheless, we observed that the features identified by resVAE ensemble generally resemble the HVG/DEGs from statistical testingbased methods like Seurat or Scanpy. However, the candidates identified by resVAE could be less biased in comparison to manual curation based on prior knowledge of what to expect, as they are not just restricted to the top genes of these methods. Here, we claim it could be less biased and interesting if we did not have to perform this curation at all, and let the network learn in a systematic manner instead.
The deep ensemble is most appropriate for methodologies utilizing approximation or probabilistic models that produce ranked results where the results are non-deterministic across different runs. Currently, ensembles of deep learning models are not commonly used. Even in classical ensembles such as random forest, the features are selected without taking rank aggregation into consideration. Here, with rank aggregation it is possible to measure the confidence or consistency of the identified features within the ensemble as we have shown. Alternatively, the deep ensemble can also be used to aggregate models with varying hyperparameters, especially when the best hyperparameters set are not known. Naturally, one can also perform equal or weighted aggregation of ensembles across methodologies to combine the results of different tools in a meaningful way.
Here we will also briefly discuss similar reported tools. The pathway module VAE (pmVAE) also utilizes VAEs in the form of pre-selected pathway modules that only act on the participating genes to encourage module independence (Gut et al., 2021). Hence, it can be used to determine the weights of each of these pre-specified pathway modules across each cell in the dataset. This is different by nature in comparison to resVAE that leverages any labeling information to retrieve features in a hierarchical manner. scArches is another family Frontiers in Cell and Developmental Biology frontiersin.org of tools with a bigger aim of tackling dataset integration by mapping queries to reference datasets (Lotfollahi et al., 2022). While it is possible to extract and cobble various parts of scArches together to achieve similar functionalities provided by resVAE, it neither addresses the same needs directly nor utilizes ensembles and rank aggregation.
Currently, clustering analysis is the standard in single-cell data analyses, where cells within the same cluster are treated in a homogenous manner. Archetypal analysis could be another option especially for transitionary cell types or states, where populations are described in terms of combinations of archetypes. We have shown that resVAE ensemble can be used in a similar fashion in the bifurcation model experiment, where populations were assigned by their similarities to specific archetypes. scAAnet is another tool that utilizes autoencoder for archetypal analysis by optimizing the archetypal constraint in the shared latent space (Wang and Zhao, 2022). Interestingly, the authors of scAAnet also remarked on the stochasticity of such deep learning methods. Moreover, they must train and evaluate multiple models manually to ensure that the outputs are stable. This is where we believe our approach with resVAE ensemble to be superior and should be highlighted, since we leveraged the inherent stochasticity by incorporating the consistencies and rankings of the outputs to yield more consistent results.
In terms of data analysis, we demonstrated that resVAE ensemble can be used on different single-cell data modalities, including scRNAseq and scATAC-seq. In addition, we also showed that resVAE ensemble can be used to identify features for joint analysis of multi-modal data in combination with other tools. resVAE ensemble can further be used with cluster assignment methods that assign partial or mixed identities for the different clusters, which cannot be usually performed in most conventional tools used to analyze these data.
Overall, we introduced rank aggregation of ensembles that improved the consistency and reproducibility of the results. In fact, we believe that the ensemble and rank aggregation approach introduced can be applied to other similar methods to potentially yield better results. Moreover, it is possible to combine the results from resVAE with the results from other methodologies in the ensemble, whether they are widely used end-to-end analysis workflows or niche tools targeting specific use cases or even upcoming tools.

Conclusion
Conventional methodologies used for single-cell data predominantly focus on comparing highly variable or differentially expressed features across different populations. Deep learning or approximation-based algorithms can be used to identify features in single-cell data in a hierarchical manner in line with biological systems. However, these methods can produce different results due to their inherent randomness. Here, we used deep ensembles on our resVAE architecture to improve the reliability and consistency of the results. Moreover, resVAE ensemble can give a measure of confidence or consistency for the identified features. Finally, this method can be used to integrate results from different methodologies and algorithms to produce a consensus result.

Materials and methods resVAE ensemble
resVAE ensemble is built on top of the restricted latent variational autoencoder (resVAE), which we introduced in our earlier publication (Lukassen et al., 2020). Apart from the introduction of ensemble with rank aggregation, we also further optimized our resVAE architecture and implementation. The ensemble is constructed by combining multiple independent resVAE models trained using identical inputs with varying hyperparameters. The cost function of the resVAE ensemble network is inherited from the base resVAE architecture, with minor adjustments through the addition of two adjustable parameters α and β targeting the reconstruction loss and Kullback-Leibler (KL) divergence respectively: In short, the cost function of the resVAE architecture consists of two parts, namely the reconstruction loss and the KL-divergence, respectively. The reconstruction loss measures the mean-squared error between the original inputs and the outputs of the autoencoder, which in our case is usually the cell by feature matrix. The KL divergence is a measure to quantify the similarity between probability distributions. Here, σ and μ represent the variance and mean of the latent distribution, respectively, while s refers to the latent offset hyperparameter. The addition of the α term allows us to prioritize the reconstruction accuracy. Meanwhile, the β term transforms the regular VAE into a β-VAE, where the β parameter can be tweaked to encourage the model to balance between prioritizing latent disentanglement or reconstruction accuracy (Higgins et al., 2016). The β-VAE is identical to a normal VAE when β α 1, while higher β value would emphasize the statistical independence represented by the KL divergence rather than the reconstruction accuracy, and thus implicitly enforcing the de-correlation of the learned latent spaces. In general, we find that higher α or β can produce better results in terms of the identified features by providing a better balance between reconstruction accuracy and latent disentanglement or through a better loss landscape as shown in our experiments (Supplementary Figure S1). We also introduced the use of label smoothing for the one-hot encoded labels, which has been shown to improve model generalization and calibration in many state-of-the-art models and tasks by preventing the network from becoming over-confident (Müller et al., 2019).
Weights initialization methods can play a critical role in training neural networks to convergence (Kumar, 2017). While rectified linear unit (ReLU) seemed to make more sense in terms of how gene expression works in biological organisms, the gradient vanishing and/or exploding problem remains an issue when it comes to training neural networks. This problem can cause neurons to become inactive and never recover during training. In practice, this could contribute to a less stable result, making it difficult to both assess the performance of the models and to obtain systematic reproducible results (Lu et al., 2019). Here we settled on the adoption of the Mish activation function in the current version of resVAE ensemble. The Mish activation function has been benchmarked extensively to demonstrate considerable improvements pertaining training Frontiers in Cell and Developmental Biology frontiersin.org stability over ReLU and other activation functions in various models and tasks (Misra, 2019).
Apart from the implementation of new activation functions for the non-linearities, we also improvised the kernel initialization method of the dense layers based on the activation function used. Now, the He Uniform kernel initialization method is paired with ReLU activations, while the Xavier or Glorot Uniform method is paired with hyperbolic tangent-based activations, which should be a better pairing as demonstrated in multiple studies and experiments (He et al., 2015;Datta, 2020).
Moreover, we adopted the combination of Rectified Adam (RAdam) with the Lookahead mechanism as our optimizer. The original Adam optimizer with its adaptive learning rate is known to suffer from bad convergence problems especially during the early stage of training, where the variance is undesirably large due to the limited amount of training samples. RAdam attempts to address this by introducing a dynamic rectifier term that adjusts the adaptive momentum of Adam by stabilizing the variance of the adaptive learning rate while avoiding the need for manual warmup (Liu et al., 2019). Meanwhile, the Lookahead mechanism allows a faster convergence by interpolating between two sets of weights that are iteratively updated during the exploration and training process (Zhang et al., 2019). The combination of these two methods has been shown to improve the learning stability with minimal computation and memory costs, with the main benefit being its robustness without the need for further hyperparameters tweaking. Nevertheless, the training time and speed will be affected by varied factors, such as the size of the datasets, the hyperparameters used etc. Using the INF-β stimulated PBMC dataset as an example, one resVAE model took around 6 minutes to train on an NVIDIA A100, and we could train several of these models in parallel either on the same GPU or across multiple GPUs.
While we primarily demonstrated the use of ensemble on identical inputs with varying combinations of α and β, more complex hyperparameters combinations and ensemble setups are possible, especially in combination with outputs from different tools.

Gene list cut-off
To further improve the gene candidates produced from the cut-offs in situations where it is difficult to identify definite elbow and knee points from the rank plots, a new cut-off calculation method is implemented. First, the data is split into N bins of equal sizes. In practice, we are usually interested in the point at which there is a pronounced drop (knee) or rise (elbow) in the weights, which are usually located in the first and the last bins, respectively. Hence, the knee point is calculated by connecting the median data point of the first bin to its first data point to obtain the rotation angle θ , such that the x -axis can be rotated to be parallel with this line and the point where y is maximum is used as the knee point. Conversely, the elbow point is calculated using the θ obtained by connecting the median data point of the N th bin to its last data point and taking the adjusted point where y is minimum. This new method allows better fine-tuning or control over the cut-offs, where the number of genes can be further increased or decreased where appropriate. The median of the different cut-offs is used as the cut-off point for the final aggregation for each label-specific list.
Here, we have an example where the new method would improve the performance by producing more consistent cut-offs and restricting the number of features selected (Supplementary Figure S9). In these cases, using the original method would yield less consistent cut-off points and result in the inclusion of excess features (Supplementary Figure S9). Here, the difference would be 3,000 extra genes candidates in the scRNA-seq dataset to more than 10,000 extra peaks in the scATAC-seq dataset.

Rank aggregation
We implemented the Robust Ranking Aggregation (RRA) algorithm (Kolde et al., 2013) in Python for use with resVAE ensemble. In this manuscript, all rank aggregations are performed using the RRA algorithm to produce one consensus ranked feature list that is used for further analyses. The meta-analysis by information content (MAIC) algorithm that can aggregate ranked and unranked lists (Li et al., 2020) is implemented as well and can be used if the ranked lists should be assigned different weights.

Data analysis
The human PBMC scRNA-seq and scATAC-seq datasets for the analysis were obtained from 10x Genomics as detailed in the Data Availability section. The myeloid differentiation simulated dataset is included in Scanpy (Krumsiek et al., 2011;Wolf et al., 2018), while the bifurcation model simulated data is generated using dyngen (Cannoodt et al., 2021). This simulation model consists of 7 cell populations with 35 transcription factors regulating 500 target genes and an additional 20 housekeeping genes that are not regulated by the transcription factors. The analyses were performed using different tools where they are applicable, including Seurat (Stuart et al., 2019), Scanpy (Wolf et al., 2018), Signac (Stuart et al., 2021) and resVAE (Lukassen et al., 2020) ensemble to assess and compare the results. Gene set enrichment analysis was performed using Metascape (Zhou et al., 2019) or Enrichr (Xie et al., 2021). scATAC-seq peaks were also analyzed using the GREAT algorithm (Tanigawa et al., 2022) via rGREAT (Gu, 2022), which assigns biological meaning to noncoding genomic regions by analyzing the annotations of the nearby genes.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors. The code for our implementation of resVAE ensemble is available on GitHub (https://github.com/labconrad/resVAE-ensemble).

Author contributions
FWT, SL, and CC conceived the project. FWT and SL designed the experiments. FWT implemented the algorithms and trained the models. FWT performed the analyses with contributions from SL and DY. YJP and NJ contributed advice on the technical details. SL, Frontiers in Cell and Developmental Biology frontiersin.org

SUPPLEMENTARY FIGURE S3
Determining cluster number using resVAE. Various scores and errors of clustering metrics are shown, where the numbers above the bars indicate the number of clusters generated for the different resolutions. The Silhouette (higher better), Calinski-Harabasz (higher better) and Davies-Bouldin (lower better) scores are standard clustering metrics that measure how well-defined the clusters are, while the Dice errors (lower better) measure how similar the different clusters are in terms of overlapping features. The non-zero mean Dice errors include only groups with overlapping features for calculations. The general trends persist regardless of the beta value, with a higher beta emphasizing the de-correlation of the different clusters. The red arrowheads highlight the two best scores of the corresponding metrics.

SUPPLEMENTARY FIGURE S4
The chord diagram illustrates the differences and overlaps of identified features between the two different cluster assignment methods. The H and S prefixes correspond to the "hard" and "soft" cluster assignment methods, respectively.

SUPPLEMENTARY FIGURE S5
An overview of the genes identified by resVAE ensemble and the extent of their co-occurences across all clusters.

SUPPLEMENTARY FIGURE S6
Heatmaps of all resVAE-identified genes in the different clusters, highlighting that resVAE can identify population-characterizing genes instead of emphasizing cluster-specific or differentially expressed genes. The numbers at the bottom-left corners indicate the number of genes resVAE identified for that specific cluster. The ratios at the bottom-right corners indicate the number of cell-type specific markers from PanglaoDB (Franzén et al., 2019) identified by resVAE, against the total number of markers annotated in the database for that cluster. The colored bars above the columns indicate the corresponding clusters. Cell-types that are ambiguous, without resVAE-exclusive genes or absent from PanglaoDB are excluded.

SUPPLEMENTARY FIGURE S7
The complete dot plot showing the expressions of all PanglaoDB cell-type markers identified by resVAE but missed by Seurat. An outlined dot indicates that the gene is specified as a marker for the corresponding clusters (columns). The red outline additionally indicates that the gene has been reported to be relevant in the context of the corresponding clusters in existing literature. The Mono/Mk Doublets cluster is a special case where outlined genes were indeed reported to be relevant in either Monocytes or Megakaryocytes cell-type, but left without a red outline. Other cell-types that are ambiguous, without resVAE-exclusive genes or absent from PanglaoDB are excluded and marked in light gray.

SUPPLEMENTARY FIGURE S8
Identification of mixed cell-types or doublet clusters using resVAE ensemble. The top bars show the number of genes identified by resVAE ensemble for the different clusters. The bottom bars indicate the ratio of genes identified in common between the two partitions of the same cluster. Mixed cell-types clusters are synthetically generated by randomly sampling cells from the specified clusters and assigned the same label. UMI doublets are synthetically generated by randomly sampling two UMIs and summing up their counts. These synthetic clusters are then split into two partitions to compare their genes identified by resVAE ensemble. We observed that these synthetic clusters tend to exhibit a lower ratio of overlapping identified genes between the two partitions, especially if they consist of fuzzier or more heterogenous populations. Synthetic clusters are marked in black, the ambiguous Mk/Mono Doublets in dark gray, and the rest in light gray.

SUPPLEMENTARY FIGURE S9
(A, B) show the comparisons between the original and the improved cut-off calculation method in scRNA-seq and scATAC-seq data, respectively.

SUPPLEMENTARY TABLE S1
Links for data used in this study.
Frontiers in Cell and Developmental Biology frontiersin.org